Homework 4¶

Maria Bochenek

Task A¶

Dataset¶

In this assigment we will again use QSAR-biodegradation dataset LINK for binary classification task. The dataset has 17 features and 1055 rows. We changes classes labels from $\{1, 2\}$ to $\{0, 1\}$ for consistency with the procedure in HW1 (it was necessary to use BCELoss while training Neural Network Classifier). Class 0 has $699$ instances and class 1 has $356$, thus this dataset's imbalance ratio is equal to $1.96$. The task is to distinguish ready (class 1) and not ready (class 0) biodegradable molecules based on molecular stucture descriptors.

Models¶

We selected sklearn.ensemble.GradientBoostingClassifier, which was one of better performing models based on analysis from HW1. To extend the scope of our analysis we included sklearn.linear_model.LogisticRegression as an example of model with worse performance for comparison.\

We set random seed for reproductibility and split dataset into training and validation datasets 80:20. We performed feature scaling using sklearn.preprocessing.StandardScaler. Models performance is presented in table below.

recall precision f1 accuracy auc
GradientBoostingClassifier 0.898305 0.688312 0.779412 0.85782 0.92568
LogisticRegression 0.762712 0.714286 0.737705 0.848341 0.899309

Selecting two observations for SHAP¶

We set random seed for reproductibility and split dataset into training and validation datasets 80:20. We perform feature scaling using sklearn.preprocessing.StandardScaler. Then we sampled 2 observations from test datasets, one from each class. The selected observations are presented below.

V1 V2 V8 V12 V13 V14 V15 V17 V18 V22 V27 V28 V30 V31 V36 V37 V39 TARGET
879 0.00285906 -0.451687 0.430836 -1.56032 0.0322026 0.699801 0.0801493 0.29255 -0.14583 -0.376155 0.0615274 -0.00880728 1.07368 0.254992 -0.26587 0.648406 -0.129319 1
1002 1.41747 1.87729 -0.39487 0.264144 0.920539 -0.26365 1.29818 0.630873 0.472339 -0.244208 1.45952 0.00355073 5.032 -0.599315 0.158288 1.10398 0.581106 0

The predictions for the selected observations are presented below.

879 1002
TARGET 1 0
GradientBoostingClassifier 1 0
LogisticRegression 1 0

Decomposition of predictions¶

Next, for the same observations ($879, 1002$), we calculated the decomposition of predictions (variable attributions) using using SHAP from dalex and shap packages.

Decomposition using dalex package¶

GB_explain_dalex.png

Decomposition using shap package¶

explain_shap_obs_879.png

explain_shap_obs_1002.png

Observation comparison¶

We can see that the SHAP decomposition differs for selected observations depending on which package is used. It is probably because Shapley values are not calculated implicitly but are approximated with linear models.

However, it is important to note that only three variables with the highest attribution for observation $879$ were the same (including the order) for both Dalex and shap packages. In comparison, it was four variables for observation $1002$. Additionally, both Dalex and shap packages estimated that the 7th, 8th, and 9th variables in terms of importance importance were the same for observation $1002$. Moreover, for observation $1002$ both implementations state that values of variables $V22$ and $V37$ have positive attribution toward prediction being of class 1, which is an incorrect class. (Ad Task A6)

Selected observations have different sets of top 3 variables of the highest importance. For observation $879$ (belonging to class 1) top 3 variables with the highest absolute attribution are in order: $V12, V22$, and $V1$, while for observation 1002 (belonging to class 0): $V36, V1$, and $V27$ (Ad Task A4).

Even though V1 has high importance for both observations it has different attribution: positive for $879$ and negative for 1002 (Ad Task A5). Also, $V1$ has higher importance (second) for observation $1002$ than for observation $879$ (third).

Model comparison¶

We compare sklearn.ensemble.GradientBoostingClassifierwith sklearn.linear_model.LogisticRegression using dalex package.

Decompositions of predictions for models are presented below (Ad Task A7).

explain_dalex_models.png

The decomposition of predictions for observations $879$ and $1002$ differ for Gradient Boosting and Logistic Regression classifiers. Interestingly only the second most important variable is the same for both models ($V22$ for observation $879$ and $V1$ for $1002$).

The variables present in the top 9 most important variables for both models (but in different positions) are

  • $V37, V14, V30, V31$ for observation $879$,
  • $V2, V37, V18, V22$ for observation $1002$.

Moreover, for both models, $V22$ and $V37$ both have positive attribution for observation $1002$, while $V31$ has positive attribution for Logistic Regression and negative for Gradient Boosting for observation $879$.

Task B¶

Calculate Shapley values for player A given the following value function

v() = 0
v(A) = 20
v(B) = 20
v(C) = 60
v(A,B) = 60
v(A,C) = 70
v(B,C) = 70
v(A,B,C) = 100

Shapley values are defines as

$$\phi_j = \frac{1}{|P|!} \sum_{\pi \in \Pi} (v(S_j^\pi \cup \{j\}) - v(S_j^\pi) )$$

,

where $\Pi$ is set of all possible permutations of players $P$ while $S_j^\pi$ is a coalition before player $j$ joins in permutation $\pi$.

Formulation for Shapley values can be rewritten as follows

$$\phi_j = \sum_{S \subseteq P \setminus \{j\} } \frac{ |S|! (|P|-|S|-1)!}{|P|!} (v(S \cup \{j\}) - v(S))$$

.

We will use this formulation to compute $\phi_A$.

We have 3 players, so $|P| = 3$.Player A can join the coalition as a first, second or third member. When A joins first (joins empty coalition) we have $S = ()$, so the A's contribution is $\frac{0!2!}{3!}(v(A)- v())$, when A joins second, there are two possible coalitions he can join $S=(B)$ or $S=(C)$, so A's contribution is $\frac{1!1!}{3!}(v(A, B)- v(B)) + \frac{1!1!}{3!}(v(A, C)- v(C))$, and when joins third then there is only one possible coalition he can join $S = (B, C)$, so A's contributions is $\frac{2!0!}{3!}(v(A, B, C)- v(B, C))$.

Thus we have

$$\begin{align*} \phi_A &= \sum_{S \subseteq P \setminus \{j\} } \frac{ |S|! (|P|-|S|-1)!}{|P|!} (v(S \cup \{j\}) - v(S)) \\ &= \frac{0!2!}{3!}(v(A)- v()) + \frac{1!1!}{3!}(v(A, B)- v(B)) + \frac{1!1!}{3!}(v(A, C)- v(C)) + \frac{2!0!}{3!}(v(A, B, C)- v(B, C)) \\ &= \frac{1}{3}(20-0) + \frac{1}{6}(60 - 20) + \frac{1}{6}(70 - 60) + \frac{1}{3}(100 - 70)\\ &= \frac{20}{3} + \frac{40}{6} + \frac{10}{6} + \frac{30}{3} \\ &=\frac{20+25+30}{3} \\ &= \frac{75}{3} \\ &= 25 \end{align*}$$

So shapley value for player A equals $25$, which is the average contribution of this player while building coalition in this coalition game.

Appendix¶

In [2]:
import dalex as dx
import xgboost
import lime
import shap

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

import pandas as pd

import numpy as np
import random

from sklearn.datasets import fetch_openml

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_auc_score
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from sklearn.utils.class_weight import compute_sample_weight

import scipy.optimize as so
from scipy import diag, interp
from itertools import cycle

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

1. Train a tree-based ensemble model on the selected dataset.¶

It can be one of random forest, GBM, CatBoost, XGBoost, LightGBM (various types) etc.

Load dataset¶

In [8]:
url = "https://raw.githubusercontent.com/adrianstando/imbalanced-benchmarking-set/main/datasets/qsar-biodeg.csv"
df = pd.read_csv(url,index_col=0)
# change target labels 2->1 and 1->0
df["TARGET"]-= 1
display(df.head())
X = df.loc[:, df.columns != 'TARGET']
y = df["TARGET"]
V1 V2 V8 V12 V13 V14 V15 V17 V18 V22 V27 V28 V30 V31 V36 V37 V39 TARGET
0 3.919 2.6909 31.4 0.000 3.106 2.550 9.002 0.960 1.142 1.201 1.932 0.011 0.000 4.489 2.949 1.591 7.253 1
1 4.170 2.1144 30.8 0.000 2.461 1.393 8.723 0.989 1.144 1.104 2.214 -0.204 0.000 1.542 3.315 1.967 7.257 1
2 3.932 3.2512 26.7 0.000 3.279 2.585 9.110 1.009 1.152 1.092 1.942 -0.008 0.000 4.891 3.076 2.417 7.601 1
3 3.000 2.7098 20.0 0.000 2.100 0.918 6.594 1.108 1.167 1.024 1.414 1.073 8.361 1.333 3.046 5.000 6.690 1
4 4.236 3.3944 29.4 -0.271 3.449 2.753 9.528 1.004 1.147 1.137 1.985 -0.002 10.348 5.588 3.351 2.405 8.003 1

Train-test split¶

In [9]:
# Spliting data
SEED = 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)
scaler = StandardScaler().fit(X_train)
#scaling - transforming
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# convert back to dataframes
X_train = pd.DataFrame(X_train, columns = X.columns, index = y_train.index)
X_test = pd.DataFrame(X_test, columns = X.columns, index = y_test.index)

Train models¶

In [20]:
models = {
    'GradientBoostingClassifier': GradientBoostingClassifier(random_state=SEED, n_estimators=100),
    'LogisticRegression': LogisticRegression(random_state=SEED, penalty='l2', solver = 'lbfgs'),
    }

explainers = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    explainer = dx.Explainer(model, X_test, y_test)
    # save explainer
    explainers[name] = explainer
Preparation of a new explainer is initiated

  -> data              : 211 rows 17 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 211 values
  -> model_class       : sklearn.ensemble._gb.GradientBoostingClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x0000017CC6553A60> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.00988, mean = 0.358, max = 0.983
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.926, mean = -0.0782, max = 0.934
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 211 rows 17 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 211 values
  -> model_class       : sklearn.linear_model._logistic.LogisticRegression (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x0000017CC6553A60> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 9.77e-11, mean = 0.348, max = 0.98
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.968, mean = -0.0685, max = 0.994
  -> model_info        : package sklearn

A new explainer has been created!
In [21]:
performance = pd.concat([explainer.model_performance().result for explainer in explainers.values()], axis=0)
performance
Out[21]:
recall precision f1 accuracy auc
GradientBoostingClassifier 0.898305 0.688312 0.779412 0.857820 0.925680
LogisticRegression 0.762712 0.714286 0.737705 0.848341 0.899309

2. Select two observations from the dataset and calculate the model's prediction.¶

In [22]:
np.random.seed(SEED)
y_test_0 = y_test[y_test == 0]
y_test_1 = y_test[y_test == 1]

idx_0 = np.random.choice(y_test_0.index, 1)
idx_1 = np.random.choice(y_test_1.index, 1)
idx = np.concatenate((idx_1, idx_0))

obs_var = X_test.loc[idx]
obs_target = y_test[idx].to_frame()
obs_df = obs_var.join(obs_target)
display(obs_df)
V1 V2 V8 V12 V13 V14 V15 V17 V18 V22 V27 V28 V30 V31 V36 V37 V39 TARGET
879 0.002859 -0.451687 0.430836 -1.560322 0.032203 0.699801 0.080149 0.292550 -0.145830 -0.376155 0.061527 -0.008807 1.073682 0.254992 -0.265870 0.648406 -0.129319 1
1002 1.417468 1.877287 -0.394870 0.264144 0.920539 -0.263650 1.298176 0.630873 0.472339 -0.244208 1.459517 0.003551 5.032002 -0.599315 0.158288 1.103983 0.581106 0

Calculate model predictions¶

In [24]:
pred_df = obs_target.copy()
for name, model in models.items():
    pred_df[name] = model.predict(obs_var)
display(pred_df.T)
879 1002
TARGET 1 0
GradientBoostingClassifier 1 0
LogisticRegression 1 0

3. Next, for the same observations, calculate the decomposition of predictions, so-called variable attributions, using SHAP from two packages of choice.¶

E.g. for Python: dalex and shap, for R: DALEX and iml.

Explain predictions with DALEX¶

In [76]:
explainer = explainers['GradientBoostingClassifier']

shap_attributions = [explainer.predict_parts(obs_var.loc[[i]], 
                                             type="shap", 
                                             label=f'Observation {i}') 
                     for i in obs_var.index]

bd_attributions = [explainer.predict_parts(obs_var.loc[[i]], 
                                           type="break_down", 
                                           label=f'Observation {i}') 
                   for i in obs_var.index]    
In [79]:
shap_attributions[0].plot(shap_attributions[1::], title="Shapley Values for Gradient Boosting Classifier")
In [78]:
bd_attributions[0].plot(bd_attributions[1::], title="Break Down for Gradient Boosting Classifier") 

Explain predictions with shap package¶

In [67]:
model = models['GradientBoostingClassifier']
shap_explainer = shap.explainers.Tree(model, 
                                      data=X_test, 
                                      model_output="probability")
In [65]:
shap_values = shap_explainer(obs_var)
for i in range(len(shap_values)):
    plt.title(f"Observation {obs_var.index[i]}", fontsize = 15)
    shap.plots.waterfall(shap_values[i], show=True)
    plt.close()

4. Find any two observations in the dataset, such that they have different variables of the highest importance.¶

See above. Selected observations satisfy this condition.

5. Select one variable X and find two observations in the dataset such that for one observation, X has a positive attribution, and for the other observation, X has a negative attribution.¶

See above. Selected observations satisfy this condition.

6. (How) Do the results differ across the two packages selected in point (3)?¶

We can clearly see that the SHAP decomposition differs for selected observations depending on which package is used. It is probably due to the fact that shapley values are not calculated implicitly but rather they are approximated.

However, it is important to note that three variables with the highest attribution for observation 879 were the same for both dalex and shap packages, while it was four variables for observation 1002. Moreover for observation 1002 both implementations state that values of variables V22 and V37 have positive attribution toward prediction being of class 1, which is inccorect class.

7. Train another model of any class: neural network, linear model, decision tree etc. and find an observation for which SHAP attributions are different between this model and the one trained in point (1).¶

We compare sklearn.ensemble.GradientBoostingClassifierwith sklearn.linear_model.LogisticRegression using dalex package.

In [69]:
explainer = explainers['LogisticRegression']

shap_attributions = [explainer.predict_parts(obs_var.loc[[i]], 
                                             type="shap", 
                                             label=f'Observation {i}') 
                     for i in obs_var.index]

bd_attributions = [explainer.predict_parts(obs_var.loc[[i]], 
                                           type="break_down", 
                                           label=f'Observation {i}') 
                   for i in obs_var.index]    
In [74]:
shap_attributions[0].plot(shap_attributions[1::], title="Shapley Values for Logistic Regression")
In [75]:
bd_attributions[0].plot(bd_attributions[1::], title="Break Down for Logistic Regression") 
In [109]:
shap_attributions = []
for i in obs_var.index:  
    for name in explainers:
        shap_attributions.append(explainers[name].predict_parts(obs_var.loc[[i]], 
                                                     type="shap", 
                                                     label=f'Observation {i} for {name}'))
In [114]:
shap_attributions[0].plot(shap_attributions[1::], title="Shapley Values - model comparison")
In [116]:
import plotly
plotly.offline.init_notebook_mode()